A decade of urban fires: Portuguese events between 2013 and 2022

This study describes a dataset containing urban fire events that took place in mainland Portugal between 2013 and 2022. The Regulation n.º3317-A/2018, established by the Portuguese National Emergency and Civil Protection Authority (Autoridade Nacional de Emergência e Proteção Civil, ANEPC), defines the Operations Management System (Sistema de Gestão de Operações, SGO). Among other attributions, this system allows to manage the lyfe-cycle of the urban fire events, from ignition to extinction, through the Operations Decision Support System (Sistema de Apoio Ã Decisão Operacional, SADO). This system supports the systematic collection of a minimum set of data on each event. All instances included in the dataset were retrieved from SADO. To make the data suitable for analytic purposes, several pre-processing actions were taken, including the steps of data transformation and cleaning. The dataset was further validated by a set of technical procedures aiming to verify both data correctness and utility. The final dataset provides the most recent multi-year record of Portuguese urban fires including 27 variables on 72641 events.

www.nature.com/scientificdata www.nature.com/scientificdata/ information that otherwise would pass unnoticed. Recently, Jin et al. 8 underlined that the trend in urban fires modelling has shifted from mathematical based to data-driven statistical learning approaches. In particular, Turner et al. 10 examined the use of machine learning techniques to identify risk factors for unintentional house fires. Other current research trend focuses on the spatio-temporal analysis which provides a framework for understanding the distribution, patterns, and dynamics of fires 11 . Despite these methodological advances, there is still a dearth of available information in particular with regard to urban fires 12,13 , which often leads to unreliable results, making it more challenging to adopt and apply evaluation tools in different urban contexts. However, obtaining comprehensive and organized fire data is frequently complicated, since generally requires the participation of local authorities and emergency management and firefighting institutions 14 . Further, raw information is often plagued by poor quality 15 . Dey et al. 15 cite various examples, such as, the utilization of inconsistent abbreviations and acronyms to identify properties, discrepancies in addresses, incomplete dates, and missing timestamps.
Thus, having good quality datasets regarding urban fire occurrences seems crucial in filling up this gap. This type of data represents a valuable tool as their study may provide significant insights into the spatio-temporal dynamics of urban fires, therefore serving as a foundation for to manage risks, model fire occurrence and building spatially integrated operational plans. Collecting urban fire data is a fundamental key step to characterize ignition patterns across time and space and identify factors that contribute to fire events allowing for targeted interventions and more efficient resource allocation. Moreover, accurate data may have the potential to guide future policies in the fire and rescue services. Additionally, the combination of fire data with information on, e.g., weather and social-economic factors can further enhance the understanding of urban fires dynamics. With the integration of spatial planning to manage fire risk becoming a priority for many local authorities 11 , publishing data descriptors on urban fire occurrences indexed both in time and space, such as this one, may support evidence-based policymaking playing a crucial role in promoting the safety and well-being of the communities.
Several studies were already based on at least a subset of the data described in this manuscript. In particular, Bispo et al. 16 model urban fire occurrences at a municipality level, ultimately mapping the probability of urban fires occurrence in mainland Portugal, and Eslamzadeh et al. 17 assessed fire departments performance using a slack-based Data Envelopment Analysis (DEA).
This study aims at providing access to the latest Portuguese Urban Fire Database, both in their raw and clean formats, including detailed information on pre-processing actions and data validation methods. Furthermore, a comprehensive description of the dataset is provided along with the code used to analyse it. Figure 1 represents schematically the study. Raw data were retrieved from ANEPC arquive. There is no historical open access data available on ANEPC website. Only daily occurrences, incompletely reported (i.e., with no data regarding injuries or deaths), are available online. The raw data were transformed and cleaned, and then checked for correctness until pass all validation criteria, making it ready to reuse in several further contexts. In addition, as the code is made available, this study is readily updatable by any user.
The next section describes in detail the used methods. The raw and final datasets are described in the Data Records section. Finally, in the Technical validation section we address the structural and content validation procedures including an exploratory data analysis to validate data further reuse. Fig. 1 Schematic representation of the study. Information on urban fires are registered in SADO plataform directly if the alert comes from the national emergency telephone number 112 or, indirectly, through the district emergency operations command CDOS if the alert is given by the general population or from the fire departments. After registered, the occurrences were downloaded (raw data) and processed (transformed an cleaned). The final fire dataset was validated based on previous domain knowledge and exploratory data analysis (EDA) making it ready for further reuse.

Methods
The Portuguese regulation n.º3317-A/2018 establishes the National Emergency and Civil Protection System Operations Management System (Sistema de Gestão de Operações, SGO). Among other attributions, this system allows the management of emergencies, through the Operations Decision Support System (Sistema de Apoio Ã Decisão Operacional, SADO). Since 2012, this platform supports the systematic collection of a minimum set of data on each emergency event and integrates national occurrences. All events reported to ANEPC must be registered in SADO. The information on a event may arrive to ANEPC through several sources, including the direct report from the population, the national emergency telephone number 112 or the communication from the fire departments. If the alerts arrive from 112, the information is directly inserted in SADO system. For the remain type of alerts, the information is first given to the telecommunications operators of the district emergency operations command (Comando Distrital de Operações de Socorro, CDOS), which then are responsible to register the information in SADO. This system allows to store relevant information on the life-cycle of fire events, from ignition to extinction. Events regarding urban fires are coded by an integer between 2101 and 2130 (depending on the type of affected structure). All instances included in the dataset were extracted from SADO www.nature.com/scientificdata www.nature.com/scientificdata/ and exported to an .xlsx Excel file. The raw urban fire data encompassed 34 features on 72743 events that took place between 2013 and 2022.
To make the data suitable for analytic purposes, several pre-processing actions were taken, including the steps of data transformation, data cleaning and data validation (see details in section Technical Validation). All analyses were carried out using R 18 , an open-source software environment for statistical computing. Firstly, the .xlsx Excel file was read into an R data frame containing all retrieved objects between 2013 and 2022. Data regarding the geographical delimitation of the administrative units of Portugal were obtain from shapefiles downloaded from the open data public certified service https://dados.gov.pt/ 19 that contains the geometry of the administrative divisions of the country, namely the official boundaries of districts, municipalities, and parishes. These files are made available by the Agência para a Modernização Administrativa, I.P. (AMA), a Portuguese public institute entrusted with the task of fostering and advancing administrative modernization in Portugal, which operates under the supervision of the Secretary of State for Digitalization and Administrative Modernization. These geometry data were restricted to mainland Portugal to allow the rigorous mapping of the events. Packages raster 20 and sf 21 were used to read shapefiles and deal with geometry indexed objects.
The raw dataset contained two type POSIXct, 15 type character and 17 type numeric features ( Table 1). The data transformation process included the following steps:

Raw name
New name Type Description  www.nature.com/scientificdata www.nature.com/scientificdata/ • all variables were renamed using English derived natural names, with only lowercase letters; • the class labels of all factors (except fire departments and administrative units names) were recoded, translating the categories labels to English; • administrative units (variables region, subregion, district, municipality and parish) were consistently written, both in fire and geometry data, using lowercase letters, followed by capitalisation; • features with constant values were deleted from the dataset (variables state, family and species); • the observations and locality variables were discarded as no consistent or complete information was retrievable from it; • variables code and codefd were deleted due to bijective correspondences with type and pfd, respectively; • variable period was deleted due to redundancy with open; • variable pfd was recoded, replacing the string "NULL" by "Unkown"; • type character variables were transformed to factors.
At this stage, the dataset included 72743 objects and 26 features (initial 34 minus 8). After transformation, the data were screened for a set of structural validation rules (see section Technical validation). The dataset was then cleaned to correct the detected errors and, subsequently, submitted to a content validation (see section Technical validation) that allowed to improve/update the data transformation and cleaning steps in an iterative way.

Data Records
The urban fire data between 2013 and 2022 were retrieved from SADO. The raw data organized in an.xlsx Excel file (rawdata.xlsx), containing high-level features for each fire record, are available at figshare 22 . Table 1 summarizes the raw dataset features including the original (raw) and given (new) names, the respective type and a summary description for each feature. Table 2 details the final dataset after processing and validating the initial data, as described in

Technical Validation
The purpose of this data descriptor is to offer a wealth of information that can be used to gain new insights, perform detailed analyses, and uncover patterns and trends in urban fire dynamics that, otherwise, would not be possible to study. To achive this goal, a dataset technical validation is mandatory.
Data validation may be generically defined as the process of analysing the quality of data and deciding whether if it satisfies the assumptions based on domain knowledge and if fits the purpose 23 . Furthermore, according to  www.nature.com/scientificdata www.nature.com/scientificdata/ the European Statistical System data validation is the "activity aimed at verifying whether the value of a data item comes from the given (finite or infinite) set of acceptable values" 24 . In this study, the data were checked iteratively for correctness until the final dataset met the desired quality. This data validation process included two steps, in which the second allowed to review the output from the first, improving the validation procedure in an iterative way. The two steps defined, by order, a (1) structural validation in which data were checked against a set of criteria and the validation results were measured, using R package validate 23 , and a (2) content validation in which the dataset was statistically reviewed to improve/update the list of validation rules and ultimately guaranty the desired dataset quality standards. The former was carried out by checking the correctness of each feature based on univariate and multivariate validation rules. The latter was based on descriptive statistics and exploratory data analysis. Table 3 summarizes the validation rules. Two types of validation rules were employed:

Structural validation.
(1) univariate, defined according to the set of acceptable domains for each feature and applied independently to each one of the variables included in the dataset, and (2) multivariate, involving relationships between variables. Univariate validation rules included checks regarding class attribute for all variables. In addition, for numerical and time/date features, we included checks about the values range, and for factors, checks based on code lists containing allowed values. Multivariate validation rules were applied to variables importance, close, length, region, subregion, district, municipality, parish, deaths, major, minor, assist, other, apc and otherv. A total of 67 initial structural validation rules were defined from which the following 13 failed (rules identification according to As a consequence, the following cleaning actions were taken: (1) variable length was replaced by the time difference between the timestamps of fire extinction and fire outbreak (close and open variables); (2) cases with negative fire lengths were removed from the dataset; www.nature.com/scientificdata www.nature.com/scientificdata/ (3) variable importance was retrieved from length, deaths, major and minor according to the respective definition (see Table 2); (4) spatial coordinates (lat and lon) were restricted to Portuguese borders; (5) the number of levels and the names of the administrative units were retrieved from geometry data; (6) district labels were retrived from geometry data according to fire ignition coordinates (lat and lon variables); (7) municipality labels were retrived from geometry data according to fire ignition coordinates (lat and lon variables); (8) parish labels were retrived from geometry data according to fire ignition coordinates (lat and lon variables).
At this stage the dataset included 72743 cases and 26 variables. Action (2) excluded 7 (0.01%) cases (6 cases with negative length plus one missing value). Action (3) corrected 26 (0.04%) low importance, 29159 (40.1%) medium importance and 35106 (48.3%) high importance entries. Action (4) excluded 95 (0.13%) objects (outside Portuguese borders). Action (5) allowed to correct one municipality name (missing accent mark). Finally, actions (6) to (8), corrected 25 (0.034%) mislabelled objects regarding their district, 239 (0.329%) mislabelled events regarding municipality and 1107 (1.522%) mislabelled objects regarding parish. These errors were confirmed by mapping the coordinates of mislabelled administrative units and checking visually that points fell outside the given administrative unit. Correction was carried out by spatial overlay between the dataset coordinates and the geometry data, finding for each data entry the true geometric unit in which each spatial point fell. This was done for all districts, municipalities and parishes, followed by mapping coordinates to validate the correction and confirm visually that points were labelled under the correct administrative unit names. Figure 2 shows the corrections regarding the district labels, whereas Fig. 3 exemplifies the municipality name correction between Loures and Vila Franca de Xira municipalities.
At this stage the dataset included 72641 cases and 26 variables. No missing values were detected, all the validation rules passed and we proceeded with the content validation, which is detailed in the following subsection.

Content validation.
In this section we present the content validation of the dataset by means of descriptive data analysis aiming to (1) update (if needed) the validation rules used in the structural validation, and (2) ultimately, by properly describing the data, guaranty the desired dataset quality standards. initial summary statistics. The first step towards content validation included verifying the summary statistics for all the variables to check if the initial structural validation was enough to guaranty the desired data quality. 10  www.nature.com/scientificdata www.nature.com/scientificdata/ This analysis showed suspicious ranges for variables length, other (number of other human injuries) and otherv (total number of civilian victims) with maximum values of 2.9 years, 900 and 915, respectively.
All data entries with length above 1 day were checked individual and manually. A set of 173 entries were found to be incorrect regarding variable length. As a consequence, these values and the corresponding closing dates/times were replaced by NA values (approximately 0.24%).
Entries regarding the total number of civilian victims were checked and found to be correct. These figures reflect the evacuation of key infrastructure sites or those in close proximity to them. Examples of such critical infrastructure include schools, places of worship, and commercial areas. The majority of these numbers are directly linked to the displacement of individuals, i.e., those who have been evacuated from these sites.
In addition, 363 entries were found to have simultaneously zero firefighters and technical resources (variables groundhr, airhr, groundtr and airtr), which is not possible. These were all replaced by NA values.
Exploratory data analysis. As mention previously one of the current main research trends focuses on the spatio-temporal analysis to study the patterns through which urban fires are distributed, structured and changed. Hence, the correctness of these two data components is of great importance. The data here made available pertains to the time frame between 2013 and 2022. During this decade, the number of urban fires per year was relatively uniform, ranging from 6764 in 2022 to 7784 in 2013, with an annual average of 7264 per year (Fig. 4a). Figure 4b shows a clear annual pattern, with the number of urban fires typically higher between November and March (≥9% of total events), corresponding to the autumn and winter seasons. Throughout the weekdays the distribution is nearly uniform (Fig. 4c), with a mild higher incidence by the end of the week (Thursday and Friday) and lower values during the weekend. Figure 4d shows that most of the occurrences happen during late afternoon and night-time hours, more precisely between 5 pm and 9 pm. Figure 5 shows the spatial distribution of urban fires and fire departments per district. Clearly there is an uneven distribution of fire incidents across Portugal, with a higher frequency along the coast. As expected ? , Lisboa and Porto districts which are characterized by having among the highest population densities in Portugal, lead the incidence ranking of urban fires across districts. The pattern displayed by the number of fire departments closely mirrors this asymmetry.
The vast majority of urban fires occur in residential buildings, representing around 73% of all the events between 2013 and 2022. Among the fires occurring in non-residential urban infrastructures, the top three categories include Industry, Workshops and Warehouses, Degraded or Unoccupied Buildings and Hotels and Restaurants, each one representing a percentage above 5%. Commercial areas and Transport stations represent around 2% of the non-residential fire occurrences. Each one of the remaining type of structures represent less than 1% of the events. The events, occurring in residential infrastructures or not, are mainly observed in the major districts of Lisboa and Porto (Fig. 6). Lisboa and Porto districs aggregate around 18% and 16% of the residential events, respectively. The percentage of residential fires that were observed in the remaining districts ranged between approximately 1 (in Portalegre) to 9% (in Setúbal). While analysing the frequency of fires spatial coordinates, we found that several entries presented exactly the same location (see, as an example, Fig. 7). In many cases, these duplicates occurred because spatial coordinates were not taken exactly but, instead, were endorsed by the parish centroid. Thus, a new categorical variable was added to the dataset-coordinates type (cty variable)-with categories not exact (representing 7% of the cases), if the spatial coordinates of the event matched the parish centroid or exact (93%), otherwise. As a consequence, two new validation rules were added (regarding variable cty, [r 68 ] class attribute = factor and [r 69 ] levels in code list, Table 3). Note that this procedure ensured the dataset to be suitable, as is, to a parish-area level analysis or a point process level analysis, if filtered by this new variable, retaining only, known to be, exact locations.
Around 10% of the events were categorized as having low importance. Approximately 40% and 50% were considered has medium or high importance urban fires. Regarding the fires length, half of the occurrences had a duration under approximately 1 hour and only 10% surpassed 2.3 hours. On average, the occurrences lasted around 1.4 hours.
The dataset also contains important information regarding human and technical resources. Around 24% of the events require less than 5 firefighters and most of them (44%) demanded between 5 and 10 ground human resources. In 1% of the events more than 35 firefighters were necessary to fight the occurrence. Most of the events took between 1 to 3 technical resources (69%). Aerial resources (human or technical) were used in approximately 0.2% of the events.
Additional information about deaths and injuries in urban fire occurrences is equally available in the dataset. In approximately 99% of the occurrences, there were no major injuries, while this value decreased to 93% when considering minor injuries. It should also be noted that 99.6% of the events had no fatal victims.
The above summary description purported to screen data correctness and facilitate data reuse by others. In addition, it may lift the veil on the potential that these data have towards the understanding of the phenomenon of urban fires in Portugal and/or serving as a tool for comparison with the reality in other countries. It should be noted that the provided summary description does not exhausts, nor intended to exhaust, the information retrievable from the dataset or the insights that may arise from cross-referencing the dataset variables with other sources of information (e.g., social-demographic, financial).

Usage Notes
There is a wide array of possible usages based on the dataset now made available that may support a variety of urban firefighting decision-making mostly related with urban fire risk management and risk analysis aiming mitigation. The prime, and perhaps the most direct usage for this dataset, is, through the use of proper statistical methods, allow to map the probability of urban fire occurrence per location, with the data potentially supporting different geographical ranges, meaning different granularity, including parish, municipality, district, subregion and region levels. Furthermore, the combination of the probability of occurrence with the hazards associated with the events (e.g., injuries, deaths) may enable the creation of a full risk map (as a function of the probability of occurrence and local hazard). Within the scope of the probability of occurrence of fire events, the application of machine learning algorithms may further support the prediction of future fire events. Beyond the elaboration of risk maps, the dataset may be used for optimization of (i) geographical location of fire departments;